-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Logging]metrics ops logs include event loop delay histogram data #120451
[Logging]metrics ops logs include event loop delay histogram data #120451
Conversation
const logMeta = getEcsOpsMetricsLog(createBaseOpsMetrics()); | ||
expect(logMeta).toMatchInlineSnapshot(` | ||
const logMeta = getEcsOpsMetricsLog(createMockOpsMetrics(testMetrics)); | ||
expect(logMeta.meta).toMatchInlineSnapshot(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snapshot was getting too big to easily notice a change, so I've changed this assertion to only test the meta
field.
const eventLoopDelayVal = process?.event_loop_delay; | ||
const eventLoopDelayValMsg = eventLoopDelayVal | ||
? `delay: ${numeral(process?.event_loop_delay).format('0.000')}` | ||
? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the histogram percentiles in the log message, we need to let folks know that the first delay entry is the mean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if it would be more readable to combine the values in the existing delay
, similar to how it is structured in the API?
"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"
(Edit: Mostly just trying to think how we can keep this as succinct as possible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be more readable to combine the values in the existing delay, similar to how it is structured in the API.
We could but that would be a breaking change for folks ingesting the log message. ATM, if someone was parsing the message string they might expect delay: number
and end up with delay: [Object]
.
Admittedly, this PR also changes the message string but keeps the delay type the same. Appending the additional data seems like a less risky option here.
Admittedly, ingesting log message data isn't the best approach and folks should be using the meta
instead. However, we removed meta
from all the logs expect for debug
mode so the extra info wouldn't be available for other levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but that would be a breaking change for folks ingesting the log message.
Yeah, although I suppose we are adding EDIT: Actually I guess not -- I misread, we have mean delay
so a grok pattern against just delay
could match two results now (both mean and histogram).mean delay
and delay histogram
🙂
However, in the past we have not considered the message format to be stable, and have made breaking changes to it in minors anyway. So I am not super concerned about this either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self review
Pinging @elastic/kibana-core (Team:Core) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Are we planning on updating the /status
page to display these additional values btw?
@@ -75,7 +95,7 @@ export function getEcsOpsMetricsLog(metrics: OpsMetrics) { | |||
}; | |||
|
|||
return { | |||
message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}`, | |||
message: `${processMemoryUsedInBytesMsg}${uptimeValMsg}${loadValsMsg}${eventLoopDelayValMsg}${eventLoopDelayHistMsg}`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT/optional: I wonder if
message: [
processMemoryUsedInBytesMsg,
uptimeValMsg,
loadValsMsg,
eventLoopDelayValMsg,
eventLoopDelayHistMsg,
].join('');
isn't more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning on updating the /status page to display these additional values btw?
It's not in the scope for this issue but I don't see why not!
How would we display the data? We could use the same cards we use now and add each entry individually or perhaps we could add another level to the metrics cards with details that would expand if a "show more" or "details" was clicked?
Maybe a simpler option would be to only show say the 50th, 95th and 99th percentiles in the same delay card, at the bottom in smaller text?
Update: I created #120667 for discussion on the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning on updating the /status page to display these additional values btw?
++ This would be a nice addition
const eventLoopDelayVal = process?.event_loop_delay; | ||
const eventLoopDelayValMsg = eventLoopDelayVal | ||
? `delay: ${numeral(process?.event_loop_delay).format('0.000')}` | ||
? `mean delay: ${numeral(process?.event_loop_delay).format('0.000')}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if it would be more readable to combine the values in the existing delay
, similar to how it is structured in the API?
"memory: 100.0B uptime: 0:00:01 load: [10.00,20.00,30.00] delay: { mean: 50.000, 50: 50.000, 95: 95.000, 99: 99.000 }"
(Edit: Mostly just trying to think how we can keep this as succinct as possible)
💚 Build Succeeded
Metrics [docs]
History
To update your PR or re-run it, just comment with: |
💔 Backport failed
Successful backport PRs will be merged automatically after passing CI. To backport manually run: |
Friendly reminder: Looks like this PR hasn’t been backported yet. |
Friendly reminder: Looks like this PR hasn’t been backported yet. |
1 similar comment
Friendly reminder: Looks like this PR hasn’t been backported yet. |
💔 Backport failed
Successful backport PRs will be merged automatically after passing CI. To backport manually run: |
Resolves #119069
metrics.ops
logs/api/status
API to allow Monitoring UI to render more detailed metrics.Logs
debug logs
"[%date][%level][%logger][%meta]---%message"
json layout prettified
/api/status
network response body
Checklist
Delete any items that are not applicable to this PR.
How to test this:
Pull this branch and run kibana with
trial
license and the following logging configuration:After kibana starts up, wait a few moments to see the output of the
metrics.ops
logs. Observe that theeventLoopDelayHistogram
appears in the log message and under theprocess
field in the logmeta
.Navigate to the status page:
localhost:5601/<basepath>/status
.Open the browsers' dev tools and inspect the network response from the
status
api endpoint. Observe that theeventLoopDelay
histogram percentiles appears under themetric/process
andmetric/processes
keys.Risk Matrix
metrics.ops
message to parse data is risky as the message string has changedRelease note
Adds 50th, 95th and 99th event loop delay histogram percentiles to
metrics.ops
logger message and meta field surfaced when logging in debug mode or injson
layout.